Add budget-window batch economy endpoint#3387
Conversation
The axes code path silently discarded all exceptions (`pass`), causing variables like NJ gross income to return null with no error trace. Now logs the full traceback via logging.exception(). Fixes #3322 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #3387 +/- ##
==========================================
+ Coverage 72.20% 72.79% +0.59%
==========================================
Files 56 56
Lines 2425 2827 +402
Branches 423 514 +91
==========================================
+ Hits 1751 2058 +307
- Misses 595 649 +54
- Partials 79 120 +41 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I think it is dangerous for request-serving code to issue schema-management SQL ( Separately, the new caching / coordination flow feels unnecessarily complex for what should be a standard cache. Using DB rows plus advisory locks plus provisional execution IDs introduces a lot of race-prone state management. Please redo this around Redis instead of SQL locking, so we have a conventional cache / in-flight coordination mechanism rather than custom lock orchestration in request code. |
There was a problem hiding this comment.
I think this should be reworked so the orchestration lives in the simulation API repo rather than in API v1.
Benefits:
- Much less request-time table editing and coordination logic in API v1. We would not need request-serving code to manage provisional rows, execution ID promotion, stale-claim cleanup, or schema-related DB behavior.
- Better scalability. The simulation API repo is already the natural boundary for spawning and tracking simulation work, and Modal can scale the worker side horizontally for us.
- Fewer race-condition risks. The current DB-lock/provisional-claim flow is a lot of custom concurrency machinery for what is essentially a batch execution problem.
- More leverage from Modal. We already get async job spawning, polling by job ID, worker isolation, and container scaling. We should use that instead of rebuilding orchestration in Flask request code.
- Easier eventual integration into API v2 alpha. If the batch orchestration lives in the simulation API repo, API v1 and API v2 alpha can both consume the same backend capability instead of duplicating orchestration logic in each API layer.
Implementation outline:
In the simulation API repo:
- Add a batch endpoint for this workflow, e.g.
/simulate/economy/comparison/batchor/simulate/economy/budget-window. - The request should accept either:
- a base simulation payload plus
start_year,window_size, andmax_parallel, or - a fully expanded list of yearly payloads plus
max_parallel.
- a base simulation payload plus
- On submission, create one parent batch job and return a single batch job ID immediately.
- The batch job should fan out child yearly simulations as separate Modal executions, up to the requested parallelism limit.
- Track child job IDs and child statuses under the parent batch job.
- Expose polling for the parent batch job so the caller can retrieve:
- overall status
- progress
- completed years
- running years
- queued years
- failed years / error message
- final aggregated annual impacts and totals once complete
- The aggregation logic for the budget window should also live there, so the backend that owns the fan-out also owns the child-job status and final merged result.
In API v1:
- Keep the budget-window route and request parsing.
- Keep the budget-window response contract, since this PR is already moving toward the right shape for the frontend.
- Replace the current per-year orchestration with a thin adapter:
- build one batch request
- submit it to the simulation API repo
- store/poll one parent batch job ID
- map the batch status/result into the existing
BudgetWindowEconomicImpactResultresponse
- Remove the request-time thread pool, DB advisory locks, provisional execution IDs, stale-claim handling, and per-year
reform_impactcoordination. - If caching is still needed in API v1, use Redis as a standard cache:
- cache completed batch results by a canonical request key
- optionally cache “batch job currently in progress for this key”
- do not use SQL locking as the cache coordination mechanism
That would leave API v1 as a thin HTTP adapter and move the concurrency/orchestration concerns into the simulation API repo, where Modal is already doing the real execution work. It should be simpler to reason about, easier to scale, and substantially less race condition-prone than the current lock-based design.
If this direction makes sense, I’d be happy to take it on in follow-up and flag @MaxGhenis.
Summary
Testing
Companion app change: PolicyEngine/policyengine-app-v2#930